NSF PAR Search | NSF Public Access Repository

Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

Wheeler Graphs and Wheeler Languages

https://doi.org/10.4230/oasics.manzini.12

Cotumaccio, Nicola; D'Agostino, Giovanna; Gibney, Daniel; Policriti, Alberto; Prezza, Nicola; Thankachan, Sharma V (June 2025, Schloss Dagstuhl – Leibniz-Zentrum für Informatik)
Ferragina, Paolo; Gagie, Travis; Navarro, Gonzalo (Ed.)
Suffix sorting stands at the core of the most efficient solutions for indexed pattern matching: the suffix tree, the suffix array, compressed indexes based on the Burrows-Wheeler transform, and so on. In [Gagie, Manzini, Sirén, TCS 2017] this concept was extended to labeled graphs, obtaining the rich class of Wheeler graphs. This work opened a very fruitful line of research, ultimately generating results able to bridge the fields of compressed data structures, graph theory, and regular language theory. In a Wheeler graph, nodes are sorted according to the alphabetic order of their incoming labels, propagating this order through pairs of equally-labeled edges. This apparently-simple definition makes it possible to solve on Wheeler graphs problems (including, but not limited to: compression, subpath queries, NFA equivalence, determinization, minimization) that on general labeled graphs are extremely hard to solve, and induces a rich structure in the class of regular languages (Wheeler languages) recognized by automata whose state transition is a Wheeler graph. The goal of this survey is to provide a summary of (and intuitions behind) the results on Wheeler graphs that appeared in the literature since their introduction, in addition to a discussion of interesting problems that are still open in the field.
more » « less
Free, publicly-accessible full text available June 19, 2026
Two-Dimensional Longest Common Extension Queries in Compact Space

https://doi.org/10.4230/LIPICS.STACS.2025.38

Ganguly, Arnab; Gibney, Daniel; Shah, Rahul; Thankachan, Sharma V (January 2025, Schloss Dagstuhl – Leibniz-Zentrum für Informatik)
Beyersdorff, Olaf; Pilipczuk, Michał; Pimentel, Elaine; Thắng, Nguyễn Kim (Ed.)
For a length n text over an alphabet of size σ, we can encode the suffix tree data structure in 𝒪(nlog σ) bits of space. It supports suffix array (SA), inverse suffix array (ISA), and longest common extension (LCE) queries in 𝒪(log^ε_σ n) time, which enables efficient pattern matching; here ε > 0 is an arbitrarily small constant. Further improvements are possible for LCE queries, where 𝒪(1) time queries can be achieved using an index of space 𝒪(nlog σ) bits. However, compactly indexing a two-dimensional text (i.e., an n× n matrix) has been a major open problem. We show progress in this direction by first presenting an 𝒪(n²log σ)-bit structure supporting LCE queries in near 𝒪((log_σ n)^{2/3}) time. We then present an 𝒪(n²log σ + n²log log n)-bit structure supporting ISA queries in near 𝒪(log n ⋅ (log_σ n)^{2/3}) time. Within a similar space, achieving SA queries in poly-logarithmic (even strongly sub-linear) time is a significant challenge. However, our 𝒪(n²log σ + n²log log n)-bit structure can support SA queries in 𝒪(n²/(σ log n)^c) time, where c is an arbitrarily large constant, which enables pattern matching in time faster than what is possible without preprocessing. We then design a repetition-aware data structure. The δ_2D compressibility measure for two-dimensional texts was recently introduced by Carfagna and Manzini [SPIRE 2023]. The measure ranges from 1 to n², with smaller δ_2D indicating a highly compressible two-dimensional text. The current data structure utilizing δ_2D allows only element access. We obtain the first structure based on δ_2D for LCE queries. It takes 𝒪^{~}(n^{5/3} + n^{8/5}δ_2D^{1/5}) space and answers queries in 𝒪(log n) time.
more » « less
Full Text Available
Efficient Encodings for Privacy-Preserving Data Storage and Transmission

https://doi.org/10.1109/BIGDATA62323.2024.10825141

Das, Arghya Kusum; Oguzhan_Kulekci, M; Thankachan, Sharma V (December 2024, IEEE)

Full Text Available
Repetition Aware Text Indexing for Matching Patterns with Wildcards

https://doi.org/10.4230/LIPICS.ICALP.2025.88

Gibney, Daniel; Huffstutler, Jackson; Parthasarathi, Mano Prakash; Thankachan, Sharma V (January 2025, Schloss Dagstuhl – Leibniz-Zentrum für Informatik)
Censor-Hillel, Keren; Grandoni, Fabrizio; Ouaknine, Joel; Puppis, Gabriele (Ed.)
We study the problem of indexing a text T[1..n] to support pattern matching with wildcards. The input of a query is a pattern P[1..m] containing h ∈ [0, k] wildcard (a.k.a. don't care) characters and the output is the set of occurrences of P in T (i.e., starting positions of substrings of T that matches P), where k = o(log n) is fixed at index construction. A classic solution by Cole et al. [STOC 2004] provides an index with space complexity O(n ⋅ (clog n)^k/k!)) and query time O(m+2^h log log n+occ), where c > 1 is a constant, and occ denotes the number of occurrences of P in T. We introduce a new data structure that significantly reduces space usage for highly repetitive texts while maintaining efficient query processing. Its space (in words) and query time are as follows: O(δ log (n/δ)⋅ c^k (1+(log^k (δ log n))/k!)) and O((m+2^h +occ)log n)) The parameter δ, known as substring complexity, is a recently introduced measure of repetitiveness that serves as a unifying and lower-bounding metric for several popular measures, including the number of phrases in the LZ77 factorization (denoted by z) and the number of runs in the Burrows-Wheeler Transform (denoted by r). Moreover, O(δ log (n/δ)) represents the optimal space required to encode the data in terms of n and δ, helping us see how close our space is to the minimum required. In another trade-off, we match the query time of Cole et al.’s index using O(n+δ log (n/δ) ⋅ (clogδ)^{k+ε}/k!) space, where ε > 0 is an arbitrarily small constant. We also demonstrate how these techniques can be applied to a more general indexing problem, where the query pattern includes k-gaps (a gap can be interpreted as a contiguous sequence of wildcard characters).
more » « less
Full Text Available
Text Indexing for Faster Gapped Pattern Matching

https://doi.org/10.3390/a17120537

Hossen, Md Helal; Gibney, Daniel; Thankachan, Sharma V (December 2024, Algorithms)

We revisit the following version of the Gapped String Indexing problem, where the goal is to preprocess a text T[1..n] to enable efficient reporting of all occ occurrences of a gapped pattern P=P1[α..β]P2 in T. An occurrence of P in T is defined as a pair (i,j) where substrings T[i..i+|P1|) and T[j..j+|P2|) match P1 and P2, respectively, with a gap j−(i+|P1|) lying within the interval [α..β]. This problem has significant applications in computational biology and text mining. A hardness result on this problem suggests that any index with polylogarithmic query time must occupy near quadratic space. In a recent study [STACS 2024], Bille et al. presented a sub-quadratic space index using space O˜(n2−δ/3), where 0≤δ≤1 is a parameter fixed at the time of index construction. Its query time is O˜(|P1|+|P2|+nδ·(1+occ)), which is sub-linear per occurrence when δ<1. We show how to achieve a gap-sensitive query time of O˜(|P1|+|P2|+nδ·(1+occ1−δ)+∑g∈[α..β]occg·gδ) using the same space, where occg denotes the number of occurrences with gap g. This is faster when there are many occurrences with small gaps.
more » « less
Full Text Available
Near-Optimal Quantum Algorithms for Bounded Edit Distance and Lempel-Ziv Factorization

https://doi.org/10.1137/1.9781611977912.118

Gibney, Daniel; Jin, Ce; Kociumaka, Tomasz; Thankachan, Sharma (January 2024, Society for Industrial and Applied Mathematics)

Full Text Available
Non-overlapping Indexing in BWT-Runs Bounded Space

Gibney, Daniel; Macnichol, Paul; Thankachan, Sharma (September 2023, SPIRE 203 (LNCS,volume 14240))

Full Text Available
Ranked Document Retrieval in External Memory

https://doi.org/10.1145/3559763

Shah, Rahul; Sheng, Cheng; Thankachan, Sharma; Vitter, Jeffrey (January 2023, ACM Transactions on Algorithms)

The ranked (or top-k) document retrieval problem is defined as follows: preprocess a collection{T₁,T₂,… ,T_d}ofdstrings (called documents) of total lengthninto a data structure, such that for any given query(P,k), wherePis a string (called pattern) of lengthp ≥ 1andk ∈ [1,d]is an integer, the identifiers of thosekdocuments that are most relevant toPcan be reported, ideally in the sorted order of their relevance. The seminal work by Hon et al. [FOCS 2009 and Journal of the ACM 2014] presented anO(n)-space (in words) data structure withO(p+klogk)query time. The query time was later improved toO(p+k)[SODA 2012] and further toO(p/log_σn+k)[SIAM Journal on Computing 2017] by Navarro and Nekrich, whereσis the alphabet size. We revisit this problem in the external memory model and present three data structures. The first one takesO(n)-space and answer queries inO(p/B+ log_Bn + k/B+log^*(n/B)) I/Os, whereBis the block size. The second one takesO(nlog^*(n/B)) space and answer queries in optimalO(p/B+ log_Bn + k/B)I/Os. In both cases, the answers are reported in the unsorted order of relevance. To handle sorted top-kdocument retrieval, we present anO(nlog(d/B))space data structure with optimal query cost.
more » « less
Full Text Available
On the Hardness of Sequence Alignment on De Bruijn Graphs

https://doi.org/10.1089/cmb.2022.0411

Gibney, Daniel; Thankachan, Sharma V.; Aluru, Srinivas (December 2022, Journal of Computational Biology)

Full Text Available
Algorithms for Colinear Chaining with Overlaps and Gap Costs

https://doi.org/10.1089/cmb.2022.0266

Jain, Chirag; Gibney, Daniel; Thankachan, Sharma V. (November 2022, Journal of Computational Biology)

Full Text Available

« Prev Next »

Search for: All records